In this report I am going to explore a dataset containing quality and other attributes of white wine. By the end of this report, I hope to identify the variables that impact the quality significantly.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Our dataset contains 12 variables with more than 4000 observations. This dataset contains no catogarical variable.
The quality of wine appears to be a normal distribution, with most wines with a quality factor of 6.
I can make the quality variable as a catagorical variable so I get seven levels of quality for our wines.
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Now I can clearly see that there are only 5 wines with a quality factor of 9. I wonder which variable affects the quality of the wine the most?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density of the wine doesn’t appear to vary much. With interquartile range of approximately 0.004, it is fair to assume that the density of the wine does not affect the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The distribution alcohol content in the wine appears to be normal. Does higher alcohol content translates to better quality of wine?
The log transformed plots of the acidity attributes reveal a normal distribution. With volatile acidity having the highest spread, and citric acidity having the lowest, which one of these attributes impact quality the most? It is important to note that the values on the x-axis are in gram/Litre and not pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Most wines have pH between 2.9 and 3.50. The median is 3.18 and the max value is 3.820. The pH of a substance can add to the substance’s “sourness” or “bitterness”. A high pH corresponds to bitter wines, while lower pH attributes to sour wine.
Transofrmed to take a better look at the distribution of residual sugar, the data appears to be bimodal with residual sugar peaking first at 1.5 and then again at 10.
Residual sugar refers to the sugar that remains in the wine after fermentation stops. Higher residual sugar would make the wine sweet. Are sweet wines of better quality?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides in wines have a normal distribution with most wines having chlorides between 0.025 and 0.065.
Free SO2 and total SO2 have a normal distribution. With free SO2 peaking at about 30, and total SO2 peaking at about 120. Sulphates also have a normal distribution.
There are 4898 observations with 13 features. (fixed.acidity, volatile.acitidy, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, pH, sulphates, alcohol, quality). The feature, quality, is a ordered factor variable with the following levels.
(Worst) ————> (Best)
quality: 3, 4, 5, 6, 7, 8, 9
Other observations:
1. Most wines have a quality factor of 6. 2. The inter-quantile range of density of wines is 0.0044.
The main feature of interest is the quality of the wines. I’d like to determine which factors affect the quality of a wine the most. I suspect residual sugar and some other combination of variables could be used to build a predictive model for the quality of wines.
investigation into your feature(s) of interest?
Alcohol content, acidity, and residual sugar are likely to contribute to the quality of a wine. I think acidity and residual sugar contribute the most to quality based on anecdotal evidence.
I did not create any new variable in the dataset, but I converted the quality variable to a catagorical factor.
I did not find any unusual data or change any variable in the dataset.
From the plot above, quality is the most correlated to alcohol. There is also a strong negative correlation between density and quality. In the previous section, I had assumed that density wouldn’t affect the quality of the wine. Maybe that assumptions was wrong. I need to take a closer look at the scatterplots.
Compairing density to quality, the first plot suffers from overplotting. zooming in by changing the limits on the axes, the corellation is still not too apparent. Perhaps some other factors affect the density in such a way that it correlates to quality.
Here I observe a very strong correlation between alcohol content and quality. Let me investigate further and check weather I get some other strong correlations.
The above plots of Quality versus residual sugar does not give us a lot of insight for the quality of the wine. I will continue exploring other variables.
From the plots above I can see a positive corellation between pH and quality, and a negative correlation between chlorides and quality.
The fixed acidity and total sulfur dioxide does not seem to affect quality much.
I will now investigate relationships between density variables to get an insight into the high correlation between Density and Quality..
The plot between density and residual sugar suffers from over plotting. Adding transparency and changing the axis limits, I can see a strong correlation between the two variables. I wonder how this plot will look like if I colour the points with quality.
Residual sugar also seems to be correlated with alcohol content. Let me chack how alcohol content is related to density.
Alcohol and density are also strongly correlated. I can clearly see that as density increases, alcohol content decreases.
The total surphur dioxide in the wine also seem to influence density a lot.
In this section I observed that quality is correlated to density the most. There is also a strong correlation between quality and alcohol, chlorides, and pH. Other variables such as residual sugar and total sulfur dioxide do not seem to affect the quality of the wine directly, but they influence density.
The most interesting relationship was between residual sugar and alcohol. The appear to be negatively correlated, which makes sense as the fermentation process by which wines are made, convert sugar into alcohol.
Other interesting relationship was between alcohol and density. The density of the wine was minimum when the alcohol content was higher.
The strongest relationship was found between density and residual sugar. Density of the wine almost entirely depends on the amount of sugar in the wine. This also explains the strong correlation between density and quality.
In the plots above it is getting difficult to see the relationship between the variables. I’d like to create some catagorical variable buckets from the existing variables to make visualising data easier.
df$density.bucket <- cut(df$density, breaks=density.breaks)
df$sugar.bucket <- cut(df$residual.sugar, breaks=sugar.breaks, labels=sglb)
df$alcohol.bucket <- cut(df$alcohol, breaks=alcohol.breaks)
df$chlorides.bucket <- cut(df$chlorides, breaks=chlorides.breaks)
df$quality.bucket <- cut(as.numeric(df$quality), breaks=quality.breaks, labels=quality.labels)
## 'data.frame': 4898 obs. of 17 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ density.bucket : Factor w/ 4 levels "(0.98,0.992]",..: 4 2 3 3 3 3 3 4 2 2 ...
## $ sugar.bucket : Factor w/ 2 levels "Low","High": 2 1 2 2 2 2 2 2 1 1 ...
## $ alcohol.bucket : Factor w/ 4 levels "(0,9.5]","(9.5,10.4]",..: 1 1 2 2 2 2 2 1 1 3 ...
## $ chlorides.bucket : Factor w/ 4 levels "(0,0.036]","(0.036,0.043]",..: 3 3 3 4 4 3 3 3 3 3 ...
## $ quality.bucket : Factor w/ 3 levels "poor","average",..: 2 2 2 2 2 2 2 2 2 2 ...
Now I have four more catagorical variables as seen above. I split alcohol, density, and chlorides using the quantile values, while I split residual sugar in two buckets to emphasise it’s bimodat distribution. I also greated a quality bucket.
quality < 6 : Poor
quality = 6 : Average
quality 7 and above : Great
Let me use these new variables in my analysis.
Quality was very strongly correlated to density of the wine. In the above plot, it is clear that density and alcohol content are very much dependent on the residual sugar.
I want to use multivariate analysis to get an insight in how these variables react with each other to influence quality.
Facet wrapping the plot by quality, I can see that the median of density in white wines decreases with quality.
The first plot confirms my belief that Alcohol content and Density greatly affects the quality of white wine. Facet wrapping the plot by residual sugar, the amount of influence of residual sugar on the wine quality is clear.
I think I now have a clear understanding of what affects the quality of wines the most. But before reaching a definate conclusion, I’d like to check if I am missing correlations with other variables.
I facet wrapped the above plots by quality. This gave me a quick way to review the relationships.
The three charts above have some interesting relationships which may affect the quality of white wine. For example, from the first chart, the variance in volatile acidity decreases with increasing quality. The same is true for total sulfur dioxide. Unfortunately, these plots are not at noteworthy as some other relationships found, and a confident model cannot be created using these features.
I think, I now have a pretty good understanding of which features influence the quality of the wine the most. With a very strong correlation, these features are of most interest:
1. Alcohol Content
2. Density
3. Residual Sugar
There are other features that may influence the quality, but there isn’t enough data to reach a conclusion. Some of these features are, 1. Chlorides
2. Total Sulfur dioxide
3. Sulphates
4. pH
In this part of my investigation I was able to strengthen my beliefs about the varibles that influenced quality. I was able to get insights into density, alcohol content, and residual sugar, which had the most significant correlation with quality.
I was able to get a deeper insight between those variables, mainly that the relationship between density, alcohol, and residual sugar remains consistent with quality. Some other variables like volatile acidity, and total sulfur dioxide had some correlation with quality, but the relationships didnot seem strong enough.
In the univariate section of this report, I assumed that density wouldn’t affect quality much. But I quickly learned that this assumption was incorrect. Density is highly correlated with alcohol content, and from the very begining, it was apparent that wines with higher alcohol content were of better quality. This was surprising because alcohol itself doesn’t have any taste.
I did not create any linear model as I feel there isn’t enough data to make a confident model. There are only five observations of highest quality wine, which is too little to determine a relationship.
This plot represents a key finding in my report. That quality of white wine is significantly correlated to it’s density and it’s alcohol content. These were the two of the strongest correlations to the quality. Density at -0.31 and alcohol at 0.44.
In the box plot at the top, the downward trend in the density of the wine tell that quality of wine increases with decrease in the density. I changed the limits on the y-axis to exclude the bottom 1% and the top 1% of the data to make it easier to see this trend. In the second boxplot, I see the relationship between alcohol content and quality. I can see an upeard trend, however there is a dip in alcohol content before it starts to rise again.
This dip makes me believe that there is a certain threshold for alcohol percent before it starts to have a positive impact on quality. In fact I am certain, that if alcohol content keeps on rising, quality will start to dip again as higher concentrations of alcohol impart a bitter taste to wines.
In this plot I see the relationship between the three most significant variables. Specifically, repltionship between density and alcohol, and residual sugar. I split the sugar values in two to better visualize the data. Wines with residual sugar less than 3g/dm^3 are “Low sugar”, and those with more than that are “High sugar”.
These three variables show the highest amount of correlation. As alcohol content increases, residual sugar and density decreases. This makes sense, as during the fermentation process, sugar is converted to alcohol. Since sugar is a heavy compound and alcohol is very light, the density goes on decreasing.
This relationship also helps us understand the age of the wine. The older the wine, the longer it is fermented, which gives it a high alcohol content and a lower density. Hence I can also say that older wines are of better quality.
This plot shows the relationship between all our important variables. I have facet wrapped the plot based on the quality of the wine, and added some jitter to the scatter plot to make it easier to visualize the date.
The relationship between residual sugar and density is very clear in the plot. With all high sugar wines having a higher median density than low sugar wines. By looking at the colour distribution of the points, I can also observe that great quality alcohols have the highest concentration of high alcohol wines, while poor quality alcohols have the highest concentration of high sugar-low alcohol wines.
Hence I can conclude that sweet wines are in fact not of great quality, and that the quality of wines is determined mostly by its alcohol content, and by extention its density and age.
To say this project was challenging would be an understatement. I had to face a lot of obstacles while completing this project, which only made my interest in data analytics grow. To put simply, this project taught me how to look at data and how to translate it to visualizations to better understand it. I became more familier with the various types of plots, like scatter plots, box plots, histograms, and even though I did not make use of them in this project, line plots.
The univariate and bivariate section was simple to do. I only had to look at the data distribution, and it’s relationships with other important variables. The multivariate section though, proved to be challenging. Even though I had the idea of what my conclusion would be, it was difficult to express it by way of plots. The dataset only had one factor variable, and it was getting difficult to make simple plots with just the quality. Daniel Wolf’s white wine exploration gave me the idea of splitting some variables like sugar and alcohol into buckets and making new catagorical variables.
This project made me realise that real world data will not always have the variables I need, and that it is important to be creative and make new variables as and when required.